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Abstract 

Intrajudge Inconsistency in Standard Setting 

In judgmental standard setting experiments, it may be difficult to specify subjective 
probabilities that adequately take to the properties of the items into account. As a result, 
these probabilities are not consistent with each other in the sense that they do not refer 
to the same borderline level of performance. Methods to check standard setting data 
for intrajudge inconsistencies are thus of paramount importance to setting meaningful 
standards. This paper presents a method of consistency analysis for standard setting 
experiments in which judges specify probabilities for each response alternatives of the 
items. The method is based on a residual diagnosis of the subjective probabilities under 
the hypothesis of a consistent judge to the probabilities. An empirical example shows 
how the method can be used to identify sources of inconsistency in response alternatives, 
items, or judges. 

Keywords: Angoff Method; Interdependent Evaluation of Alternatives; Intrajudge 
Inconsistency; Polytomous Response Models; Nedelsky Method; Standard Setting. 
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Introduction 

The question of how to justify standards for educational tests or assessments has been a 
persistent source of debate among educators and measurement specialist. The dominant 
view since a discussion ina 1978 special issue of the Journal of Educational Measurement 
(Glass, 1978; Hambleton, 1978; Popham, 1978) is that standards can not be justified by 
an independent criterion but that their justification should follow from an evaluation of 
the procedure used to set them. Though the question of what constitutes a good standard 
setting method still is controversial, it seems safe to assume that for judgmental standard 
setting methods the requirement of the judgments being consistent with the objective 
properties of the test items has universal validity. 

Several types of inconsistent judgemental behavior in standard setting are possible 
(van der Linden, 1996). For example, in an experiment in which judges evaluate 
actual test booklets of examinees, they may specify that Booklet A demonstrates more 
proficiency than Booklet B, Booklet B more than Booklet C, but Booklet C less than 
Booklet A. Likewise, in an Angoff (1971) experiment, a judge may specify higher 
probabilities of success for items that are more difficult. This type of inconsistency 
parallels the one in an Nedelsky (1954) experiment with a judge eliminating more options 
for a more difficult item. Inconsistency may also happen in standard setting experiments 
with two tests or assessment instruments. If the standard set on one instrument can not 
be predicted from the one on the other by a (possibly nonlinear) regression equation 
that excellently fits the bivariate distribution of response data for both instruments, these 
standards are inconsistent. 

Each of these examples points to inconsistent behavior in standard setting that can 
be characterized as intrajudge inconsistency. Besides, the term interjudge inconsistency 
has been used to describe differences in standards between judges in the same experiment. 
Though analyses of interjudge consistency may be useful in standard setting experiments 
where judges are provided with meticulously defined performance levels, the requirement 
that different judges should set the same standard does not have the universal validity the 
requirement of intrajudge consistency has. 

It is the purpose of this paper to introduce a method for analyzing intrajudge 
inconsistencies in standard setting experiments where the judges are required to specify 
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probabilities of an examinee functioning at the borderline level of performance for each 
response alternative of the items. The standard setting method used in the empirical 
example was the method of interdependent evaluation of all response alternatives (IDEA) 
(Chang, van der Linden & Vos, 2001). The method combines positive and avoids negative 
features of the Angoff and the Nedelsky method. Like the Nedelsky method, it forces the 
judges to look into all alternatives and to evaluate the behavior of borderline examinees 
with respect to each of them. On the other hand, like the Angoff method, it allows 
judges to specify probabilities on the full scale from [0,1] and avoids the problems of 
discreteness inherent in the Nedelsky method. However, the standard is calculated only 
from the probabilities for the correct alternative. Though not highlighted in this paper, 
the proposed method of analyzing intrajudge inconsistency can also be used in standard 
setting experiments with polytomously scores items. The only difference would be a 
possible other choice for the item response theory (IRT) model used in the empirical 
example below. 

The method is based on the technique of residual analysis in statistics. It requires a 
model for the probabilities on the alternatives be fit to response data from a representative 
set of examinees and then analyzes the residual probabilities under the hypothesis of 
consistent judgements. This type of analysis was introduced in van der Linden (1982; 
see also Kane, 1 987) for standard setting experiments based on the Angoff or Nedelsky 
method. The current paper generalizes the applicability of the analysis to standard setting 
experiments that exploit the full set of response alternatives. One of the advantages of 
this generalization is that it is now possible not only to identify sources of inconsistency 
that reside in the judge or in the items but also in specific response alternatives or in 
interactions between judges and specific alternatives. In fact, a surprising finding in 
the empirical example in this paper is that nearly all judges had systematically greater 
difficulty dealing with the correct than with the incorrect alternatives of the items. 

Definitions and Notation 

The test items used in the standard setting experiment are denoted as i — 1 , ..., n, with 
the response alternatives for item i denoted as k % = 1, A separate notation is 

needed for the correct and wrong alternatives of the items. The correct alternative of item 
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i is denoted as <?*, while an arbitrary incorrect alternative is denoted as w % . The items 
are assumed to measure a (unidimensional) variable 9 representing the performances of 
the examinees. Each of the judges j - 1 , N is asked to choose a standard for the 
performance level required from the examinees. The standard forjudge j is denoted as 
a cut-off score 9 C] . Observe that the standards are indexed by j because different judges 
may have different standards. If the standard setting experiment is required to result in a 
single standard for the whole panel of judges, some form of consensus making has to be 
introduced in the standard setting process that results in a common choice 9 cj = 6 C for all 
judges. Alternatively, a statistical operation, for instance, averaging of individual cutoff 
scores, can be used to combine all individual standards into a single standard. 

For each item the judges are required to specify the probabilities of an examinee 
operating at performance level 9 cj to produce response Xc=k { on item i. Because this 
probability is the result of a judgmental process, it is denoted as p s kij , with superscript s 
to indicate its subjective nature. The fact that these probabilities are required to sum to 1 
forces the judges to coordinate their specifications between the alternatives. 



IRT Model 



If the response data for the populations of examinees fit an IRT model for items, with 
a polytomous response format, we also have objective probabilities for response Xi=ki 
by an examinee at performance level 6 C j . In the empirical example below, Thissen and 
Steinberg’s model for multiple-choice items (Thissen & Steinberg, 1984, 1997) was fitted 
to the data. The model defines the probability of an examinee at 6 cj producing response 
Xi = k { as: 



r _pry u | p i exp{a fci (0 cj frfci)} + d ki explaoj^cj froj} /1X 

Pko — _ Ki | <7 CJ ;/ = x^rni r 77 T — vj > ( 1 ) 

where b ki and a ki are the location and discriminating power of alternative k of item 
i, respectively. The model, which generalizes Bock’s (1997) nominal response model, 
was chosen because of its flexibility to deal with guessing on multiple-choice items. 
It does so by assuming that among the examinees that give response /c* to item i an (a 
priori unknown) proportion d ki guesses (E^Li d h = 1)- The process of guessing is not 
assumed to be blind but to be dependent on 6 with probabilities given by exp{a 0i (# CJ — 
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b 0i } / ex P{°/ii {6 a — bhi)}, with a 0i and b 0 . denoting the location and discriminating 

power of the response function for the examinees who guess. 

When the model in (1) is fitted to data from achievement tests, the response function 
for the correct alternative should be monotone in 0. In the application below, the validity 
of this assumption is tested against the alternative of a nonmonotone response function. 



Error Definition 



Observe that 9 C j should be calculated from the subjective probabilities provided by judge 
j under the hypothesis of consistent judgments. The assumption is typical of the technique 
of residual analysis used in this paper. The steps in this technique are: First, a model for 
the probabilities on the alternatives is fitted to the response data from a representative set 
of examinees. In the application below, the model is the one specified in (1). Second, 
under the null hypothesis of a consistent judge a cutoff score is fitted to his/her subjective 
probabilities of the judge. Third, the residuals, that is, the differences between the 
objective probabilities from the model and the subjective probabilities form the judge, 
are calculated. Fourth, the residuals are analyzed for inconsistencies, and potential 
explanations of the inconsistencies are developed. 

For the current response model in (1), the calculation of the cutoff score 0 C j forjudge 
j in the second step is based on the following operations: 

1 . Summing the probabilities p s gij over the items in the test; 

2. Summing the objective probabilities for the correct alternatives over the items in 
the test; 

3 . Equating the two sums and calculating 9 C j as the root of the equation. 

That is, 0 C j is calculated as the root of: 



A = A exp {ogffici ~ fr 3i )} + d h e xp{a 0i (fcj ~ bo,)} 

2^P gij Z-, E^oexpK^-^)} 



( 2 ) 



2=1 2=1 

The error by judge j on alternative k of item i is thus equal to the residual probability 



e kij ~~ Pkij Pkij • 



( 3 ) 



O 
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It is now possible to aggregate the error in (3) over response alternatives, items, or 
judges. This aggregation results in inconsistency indices for (combinations of) judges and 
items. We first introduce a set of unstandardized inconsistency indices and then indicate 
how to standardize these indices to take possible values only in the interval [0,1]. 

Errors by Individual Judges 

The absolute errors by judge j on the correct and incorrect alternatives of item i are given 



respectively. 

Aggregating these errors over the items gives the following indices for the average 
errors by judge j on the correct, incorrect and across all alternatives at the level of the 
test: 



by 



e 9ij — | Pgij Pgd 



(4) 



and 



e Wij = (m, - 1) 1 Hd ~ Pkd\ , 



(5) 



k=l]k^g 



n 




( 6 ) 



1=1 




(7) 



i=l k=\\k^g 



n 



n 77i{ 




( 8 ) 



This choice for absolute values of the errors is made to prevent them from 
compensating each other when they are aggregated within or between items or judges. 
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Errors by Panel of Judges 

Test items differ in the likelihood of a judge making an error in his/her specification of 
the subjective probabilities. The reason for such differences may reside, for example, in 
sloppy behavior by the judge, but also in the formulation of the items, the difficulty of the 
correct alternative, or the familiarity of the judge with specific topics in the domain tested. 
Item analysis based on errors aggregated over the panel of judges can help to reveal the 
actual sources of such differences. 

The following equations give the average errors on the correct alternative, incorrect 
alternatives, and across all alternatives of item i across the panel of judges: 

N 

«« I (9) 

i= i 

N TYli 

e Wi = N- 1 (m i -l)~ 1 ^2 H d -Pkd\ ( 10 ) 

j~ 1 k=l\k^g 
N mi 

= ( 11 ) 

3 = 1 fc=l 

Analogous to (6)-(8), the errors by a panel of judges can be aggregated over all items 
in the test. These aggregates can be used, for example, to detect differences between the 
error levels for the correct and incorrect alternatives or the general error level for the panel 
of judges on the test. The equations are: 



N n 



e 9 = (Nn) 1 £ £ \p° 9ij ~ 



Pad] 



( 12 ) 



1 2=1 



n N n mi 

e w = (N)-\ y ^m i -n)- l Y^Y^ £ \Pkj-Pkd \ 

i=l j= 1 i = 1 k=l;k^g 



( 13 ) 
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n N n m-i 

t = \ptd - I 

i = 1 j=l i= 1 k=l 



(14) 



Standardized Consistency Indices 



The above inconsistency indices should be used descriptively. The problem of how to 
specify a statistical test for the hypothesis of consistent judgements is still hampered by the 
fact that it appears to be difficult to formulate a valid statistical model for the distribution 
of the subjective probabilities p k .j across replications. To support the comparison of 
errors between items, judges, or occasions, it is therefore important to have standardized 
versions of the indices that have a common range of possible values. 

Standardization of the above indices to indices that can take values in the full interval 
[0, 1 ] is achieved through the following transformation 



C = 



M e — e 
M t ’ 



(15) 



where e is a generic symbol for the inconsistency indices and M e is the maximum value 
of the index possible. The maximum is found if index e is calculated with the expression 
Pld ~ Pkd in (3) replaced by 



max{pfc d , 1 (16) 

Because the calculations are straightforward, no equations for the consistency indices are 
given. 

The main purpose for standardizing the residuals is to make them independent of the 
objective probabilities of success at the performance level of the borderline examinee, 
9 C j. The maximum residual in (16) varies as a function of 9, whereas index C does not. 
Observe that the direction of C is also opposite to the direction of e. C should therefore 
be considered as a consistency index; the closer its value to 1, the more consistent the 
judgments. The maximum C= 1 is obtained if at 9 C j it holds that pf • = for all 
alternatives, items, and/or judges over which the index is defined. 
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Empirical Example 

A standard setting experiment was conducted in which eight judges used the method of 
interdependent evaluation of items alternatives (IDEA) to set a pass-fail standard on a test 
of German as a second language consisting of items used previously in a national school 
leaving exam at the end of secondary education in the Netherlands. The purpose of this 
small experiment was not to set an actual standard for the exam or to assess the typical 
error level of judges operating in a standard setting experiment, but only to illustrate the 
type of residual analysis of intrajudge inconsistency advocated in this paper. 

The original test used in the national exam had 29 items with 3-5 response alternatives. 
The items were calibrated under the model in (1) using the response data from 16,1648 
examinees and the software program Multilog (Thissen, 1991). The goodness of fit of the 
model was assessed both against a less restrictive and more restrictive model fitted on the 
same data set. The direct likelihood-ratio test of the model against the general multinomial 
alternative in Multilog could not be used because the number of examinees was of much 
smaller order than the number of possible response patterns (1. 938x1 0 11 ). For the use 
of such alternative goodness-of-fit tests, see Thissen and Steinberg (1984). The less 
restrictive model was Mokken’s (1997) nonparametric response model. This model was 
used to check the items for unidimensionality of 6 as well as monotonicity of the response 
function for the correct alternative using the software program MSP 5 (Molenaar & 
Sijtsma, 2000). A set of 19 items yielded a scalability coefficient H~. 14, which is to 
be considered as a conservative value (Molenaar & Sijtsma, 2000). Because the Mokken 
model does not assume any parametric form for the response functions, it follows that 
the data support these two critical assumptions. The assumption of monotonicity of the 
response functions for the correct alternatives is particularly important because the model 
in (1) was applied to achievement test items. The more restrictive model was the nominal 
response model (Bock, 1997). For the same set of items, a likelihood-ratio test showed 
that this model had to be rejected in favor of the model in (1) (p <.00 1). This set was 
therefore used in the experiment. 

The judges were secondary school teachers with an average of 8.5 years of experience 
in teaching German. The judges were trained using realistic exercises until each of them 
declared to be competent in the task. It is believed that both the selection and training of 
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the judges qualifies them for the standard setting experiment (Raymond & Reid, 2001). 

[Table 1 and 2 about here] 

Summaries of the (aggregated) residuals for the correct and incorrect alternatives are 
given in Table 1 and 2, respectively. A consistent trend in the two tables is the difference 
between the residuals for the correct and incorrect alternatives. The average residuals 
across all judges and items is . 1 8 for the correct and . 1 1 for the incorrect alternatives. The 
ranges for the average residuals per judge are remarkably small: (.16-.23) for the correct 
alternatives and (.10-. 13) for the incorrect alternatives. The difference in range can be 
explained by the fact that the results for the incorrect alternative are based on an extra 
step of averaging (the number of incorrect alternatives was 2-4). 

The average residuals per item are also in a relatively small range for the incorrect 
alternatives, (.07-. 1 8). However, the average residuals per item for the correct alternatives 
showed two outlying results: .35 for Item 2 and .45 for Item 12. If these results are 
neglected, the range runs from (.07-.22). 

A comparison between the residuals for Item 2 and 12 in Tables 1-2 shows that both 
are uniformly high across judges for the correct alternative. Item 12 also shows uniformly 
high residuals for the incorrect alternative, whereas Item 2 shows results for the judges that 
are not systematically larger than from those for the other items. There are two reasons 
why residuals can be large: (1) attributes specific to the item that make it difficult to 
specify subjective probabilities for one or more of its alternatives; and (2) the dependency 
of the residuals on 9 C j. 

[Table 3 and 4 about here] 

The latter explanation can be rejected if the analysis is based on standardized 
consistency indices. Table 3 and 4 shows the values of these indices for the same 
items and judges. A comparison between these two sets of tables seems to support the 
hypothesis that the results for Item 12 are due to the attributes of the item, in particular, 
attributes of the correct alternatives (the values for the incorrect alternative do not show 
any remarkable pattern). The results for Item 2 where more in line with those for the 
others items (albeit that the average consistency across judges is among the lowest values). 
Getting back to the response data for the examinees, the authors found that the p-value 
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for these examinees (.40) was lower than the a-value for one of the incorrect alternatives 
(.49). This observation may suggest an ambiguity in the correct alternative. In a real-life 
application of this method, with feedback to the judges, the next step would be to ask the 
judges to discuss this alternative. If the conclusions did not converge, or if they indicated 
a technical error in this alternative, the natural decision would be to remove the item form 
the set and ask the judges to reconsider their subjective probabilities on the other items 
only. 



Concluding Observations 

A systematic trend in the results in Table 1 -4 is more consistent behavior for the incorrect 
than for the correct alternatives of the items. This trend seems to hold for nearly each judge 
(the only clear exception is Judge 5). The fact that this trend holds for the standardized 
consistency indices as well as for the residuals seems to exclude explanations based on 
differences in response probabilities for examinees performing at the cutoff scores 6 CJ . 
As a tentative explanation, it is suggested that correct alternatives are more difficult to 
comprehend than incorrect alternatives and that the judges were therefore less capable of 
specifying probabilities of success on items. 

It is not known if this trend generalizes to other content domains. If it would, 
an interesting practical conclusion would be to set standards using probabilities on the 
incorrect rather than the correct alternatives. The standard on the 0 scale should then be 
calculated from a version of (2) where in the left- and right-hand side the sums are defined 
over the most consistent incorrect alternative. Or as an average over a subset of consistent 
incorrect alternatives. The standard on the number-correct scale cutoff score would then 
follow from the one on the 6 scale via the right-hand side of the current version of (2). 
This method would amount to a continuous version of the Nedelsky technique. In fact, 
the method of interdependent evaluation of alternatives used in the empirical example is 
flexible enough to make a post hoc decision on what probabilities to use in the definition 
of the sums in (2), that is, after all probabilities have been obtained and it is known on 
which alternative the judges have operated most consistently. 

In the empirical example, the objective probabilities p kij were calculated using 
estimates for the parameters in the response model in ( 1 ). Though it is possible to calculate 
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confidence intervals or posterior highest density intervals for the values in Table 1-4 to 
account for estimation error, this was not done. The number of examinees used to fit the 
model to the response data was large enough to ignore estimation error. Also, with such 
intervals, the users may be tempted to interpret the results as if they are from statistical 
tests, while, as indicated earlier, they should be used only to describe the consistency of 
judges in the standard setting experiment. 

As already alluded to earlier, implementations of standard setting experiments in 
real life typically have several stages in which judges are encouraged to reconsider their 
subjective probabilities based on feedback they receive from the facilitator of the process 
(Reckase, 2001). The proposed use of the residual analysis introduced in this paper is as 
part of this type of feedback in a multi-stage experiment. It is believed that the format 
used in Tables 1-4 is easy to understand by the judges typically used in such experiments. 
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